Overview

Give a brief a description of your project and its goal(s), what data you are using to complete it, and what three faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository.

This project aims at identifying predictors of health outcomes using socioeconomic factors. Specifically, income, housing costs, food security, and food costs will be tested for association with surveyed outocmes from Behavioral Risk Factor Surveillance System (BRFSS) epidemiologic data. BRFSS data is tracked at the metropolitan level, and SMART BRFSS 2017 data was downloaded as XPT file here Measures of mental and physical health will be analyzed using secondary data from the American Community Survey (ACS), Housing Urban Development (HUD) fair market rent (fmr) data, Feeding America dataset. The main faculty for this project is Blanca Himes and Sherry Xie, who assisted with geospatial plotting and formatting.

Introduction

Describe the problem addressed, its significance, and some background to motivate the problem.

Do aggregate factors compare well to individual factors? This would be particularly useful since census data is extensive.

Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff.

Methods

Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why. Use 5-year ACS data from 2014-2018 for poverty rate and median income. Collect HUD fmr rate for the past from 2017-2020. Finally I use the “Map the Gap” calculated county-level meal cost to estimate food expense. The cost-of-food index is calculated by applying county-level sales tax rates to Nielsen basket-of-goods based on actual pounds purchased. This value was last calculated in 2010 by Feeding America. Necessary corrections to nominal housing costs for inflation were corrected using CPI values from 2010-2020. BRFSS data is ta

factors: poverty rate, median income, median income spent on food, median income spent on rent, race, sex, education level. Look into high versus low cost-of-living metropolitan areas.

Exploratory Analyses

Download and install Rtools40 to use the BLS api: https://cran.r-project.org/bin/windows/Rtools/

Reading and Cleaning Datasets

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## #Uighur
## Loading required package: usethis

Query CPI data from BLS

##   year period periodName   value signature_monthly state                      area
## 1 2020    M10    October 260.892       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 2 2020    M08     August 259.336       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 3 2020    M06       June 257.942       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 4 2020    M04      April 258.978       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 5 2020    M02   February 259.121       CUURS35ESA0    MD Baltimore-Columbia-Towson
## 6 2019    M12   December 257.847       CUURS35ESA0    MD Baltimore-Columbia-Towson
##   year period periodName   value signature_annual state                          area
## 1 2020    S01   1st Half 258.891      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 2 2019    S02   2nd Half 257.288      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 3 2019    S01   1st Half 256.485      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 4 2018    S02   2nd Half 254.382      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 5 2018    S01   1st Half 252.401      CUUSS35ESA0    MD     Baltimore-Columbia-Towson
## 6 2020    S01   1st Half 244.889      CUUSS35CSA0    GA Atlanta-Sandy Springs-Roswell
## # A tibble: 6 x 5
##   area                          year  `1st Half` `2nd Half` annual
##   <chr>                         <chr>      <dbl>      <dbl>  <dbl>
## 1 Baltimore-Columbia-Towson     2020        259.        NA    259.
## 2 Baltimore-Columbia-Towson     2019        256.       257.   257.
## 3 Baltimore-Columbia-Towson     2018        252.       254.   253.
## 4 Atlanta-Sandy Springs-Roswell 2020        245.        NA    245.
## 5 Atlanta-Sandy Springs-Roswell 2019        242.       246.   244.
## 6 Atlanta-Sandy Springs-Roswell 2018        238.       239.   239.
## # A tibble: 6 x 15
##   area                           year October August  June April February December September  July   May March January November annual
##   <chr>                         <dbl>   <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>  <dbl>
## 1 Baltimore-Columbia-Towson      2020    261.   259.  258.  259.     259.      NA         NA    NA    NA    NA      NA       NA   259.
## 2 Baltimore-Columbia-Towson      2019    258.   257.  257.  259.     254.     258.        NA    NA    NA    NA      NA       NA   257.
## 3 Baltimore-Columbia-Towson      2018    255.   255.  254.  252.     252.     253.        NA    NA    NA    NA      NA       NA   254.
## 4 Atlanta-Sandy Springs-Roswell  2020    249.   248.  245.  243.     247.      NA         NA    NA    NA    NA      NA       NA   246.
## 5 Atlanta-Sandy Springs-Roswell  2019    246.   246.  243.  243.     240.     245.        NA    NA    NA    NA      NA       NA   244.
## 6 Atlanta-Sandy Springs-Roswell  2018    239.   241.  240.  237.     237      237.        NA    NA    NA    NA      NA       NA   239.

Now let’s visualize economic growth in various metropolitan areas around the US

## `summarise()` ungrouping output (override with `.groups` argument)

Query Fair market rent (FMR) data from HUD

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: grid
library(stringr)
library(grid)

# function to format the geo id column of FMR dataframes
format_id <- function(df){
  
  df$ID <- str_replace(as.character(df$fips2010),
                       as.character(df$CouSub),"")
  df$ID <- as.numeric(df$ID) + 50000000000
  df$ID <- paste0("0",as.character(df$ID))
  
  return(df)
}

# read in the housing costs: fair market rates (2017-2020)
# format relevent columns into a standard form
fmr_2017 <- read.csv("FY2017_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr0", "fmr1", "fmr2", "fmr3", "fmr4"), as.numeric)
fmr_2018 <- read.csv("FY2018_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code")
fmr_2019 <- read.csv("FY2019_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub","fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code") 
fmr_2020 <- read.csv("FY2020_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2", 
         "fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code") 
fmr_2021 <- read.csv("FY2021_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
  dplyr::mutate_at(c("rent50_0", "rent50_1", "rent50_2", "rent50_3", "rent50_4"),
            as.numeric) %>%
  dplyr::rename("CouSub" = "cousub", "fmr0" = "rent50_0",
         "fmr1" = "rent50_1","fmr2" = "rent50_2", 
         "fmr3" = "rent50_3", "fmr4" = "rent50_4",
         "Metro_code" = "cbsasub21", "areaname" = "areaname21",
         "countyname" = "cntyname") 

# read in shape file of US counties
counties <- readRDS(gzcon(url('https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds'))) # decompress  rds file and read in shape file
counties$GEO_ID <- as.character(counties$GEO_ID)#change geo id to characters in counties df
counties$ID <- gsub("US","", counties$GEO_ID)  #remove US from ID

# read in metro area shape file and convert into spatial components
setwd("tl_2017_us_cbsa")
cbsa.sf <- st_read("tl_2017_us_cbsa.shp")
## Reading layer `tl_2017_us_cbsa' from data source `C:\Users\chana\Box Sync\Graduate School\Courses\Fall 2020\BMIN 503\Final Project\BMIN503_Final_Project\tl_2017_us_cbsa\tl_2017_us_cbsa.shp' using driver `ESRI Shapefile'
## Simple feature collection with 945 features and 12 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -178.4436 ymin: 17.83151 xmax: -65.42271 ymax: 65.45352
## geographic CRS: NAD83
cbsa.sf %<>%
  dplyr::select(GEOID, geometry,NAME, NAMELSAD) %>%
  dplyr::rename(ID2 = GEOID, Metro = NAMELSAD)


# function to merge FMR dataframes with county geometries 
geo_format <- function(geo_df){
  fmr_geo <- counties %>%
    dplyr::select(c("NAME", "ID", "geometry")) %>%
    left_join(geo_df[,c("ID","fmr0", "fmr1", "fmr2", "fmr3", "fmr4",
                        "Metro_code", "areaname", "countyname")], by="ID") %>%
    distinct(Metro_code, .keep_all = TRUE)
}

# function to merge FMR dataframes with metro area geometries 
geo_format2 <- function(geo_df){
  
  geo_df %<>%
    dplyr::select(ID,fmr0,fmr1,fmr2, fmr3, fmr4, Metro_code, areaname, countyname)%>%
    distinct(Metro_code, .keep_all = TRUE) %>% # removed repeated counties
    dplyr::mutate(ID2 = substr(Metro_code, 12,16)) # metro IDs
  
  fmr_geo <- cbsa.sf %>%
    dplyr::select(c("NAME", "ID2", "geometry", "Metro")) %>%
    left_join(geo_df, by="ID2") # join by metro area

}

# format geographical IDs for all FMRs 2017-20201
fmr_2017 <- format_id(fmr_2017)
fmr_2018 <- format_id(fmr_2018)
fmr_2019 <- format_id(fmr_2019)
fmr_2020 <- format_id(fmr_2020)
fmr_2021 <- format_id(fmr_2021)

# add county geographical data to all FMRs 2017-20201
fmr_2017_geo <- geo_format(fmr_2017)
fmr_2018_geo <- geo_format(fmr_2018)
fmr_2019_geo <- geo_format(fmr_2019)
fmr_2020_geo <- geo_format(fmr_2020)
fmr_2021_geo <- geo_format(fmr_2021)

# add metro area geographical data to all FMRs 2017-20201
fmr_2017_geo2 <- geo_format2(fmr_2017)
fmr_2018_geo2 <- geo_format2(fmr_2018)
fmr_2019_geo2 <- geo_format2(fmr_2019)
fmr_2020_geo2 <- geo_format2(fmr_2020)
fmr_2021_geo2 <- geo_format2(fmr_2021)
library(RColorBrewer)
library(leaflet)

my_theme <- function() {
  theme_minimal() +                                  # shorthand for white background color
  theme(axis.line = element_blank(),                 # further customization of theme components
        axis.text = element_blank(),                 # remove x and y axis text and labels
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  # make grid lines invisible
        legend.key.size = unit(0.8, "cm"),           # increase size of legend
        legend.text = element_text(size = 16),       # increase legend text size
        legend.title = element_text(size = 16),
        plot.title = element_text(hjust = 0.5))      # increase legend title size
}

myPalette <- colorRampPalette(brewer.pal(9, "BuPu"))    # Blue-Purple
myPalette2 <- colorRampPalette(brewer.pal(9, "YlOrRd")) # Yellow-Orange-Red
myPalette3 <- colorRampPalette(brewer.pal(9, "YlGnBu")) # Yellow-Green-Blue

# geographic visualizations of housing costs
cpi_by_year <- read.csv("annual_cpi_us.csv", fileEncoding = "UTF-8-BOM")
cpi_by_year$adjust <- cpi_by_year$cpi/258.2937 # adjustment factor into 2020 dollars

# use the cpi_by_year adjustment to adjust rents to 2010 dollars
# rent in y year = (rent in current year) X (CPI y year/CPI current year)

# Select a color palette with which to run the palette function
pal_fun  <- colorNumeric("BuPu", NULL)      # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL)    # Yellow-Orange-Red from RColorBrewer

# function to get inflation correction factor for a given year into 2020 dollars
i_factor <- function(year){
  c_factor <- cpi_by_year$adjust[which(cpi_by_year$year == year)]
  return(c_factor)
}

# function to generate static map of rent costs by year across all coutnies in US
map_rent <- function(df, year){
    correction = i_factor(year)
 
  fmr_map = ggplot() +
    geom_sf(data=df,lwd=0, 
            aes(fill = fmr1/correction)) + # adjust factor included
    ggtitle(paste(year, "Median Rent for 1-bed")) + 
    scale_fill_gradientn(name = "Rent (2020 $)", 
                         colours = myPalette2(100),
                         limits = c(0,2000),
                         breaks = c(500,1000,1500,2000)) + # add limits and breaks to normalize scaling between years  
    north(data = df, symbol = 12, location = "bottomright") +
    scalebar(data = df, dist = 500, dist_unit = 'mi', transform = TRUE,
             location = "bottomleft",
             st.dist = 0.1) +
    my_theme()
}

# call static map function and display
fmr1_2017_map <- map_rent(fmr_2017_geo, 2017) # generate static rent map for counties
fmr1_2017_map # display rent map

# an interactive map of rent prices
library(htmlwidgets)
library(htmltools)

pal_fun <- colorNumeric("YlOrRd", NULL)  # Yellow-Orange-Red from RColorBrewer


# ------------------------------------------------- #
# HTML code to fix misaligned "na" in map legend
# ------------------------------------------------- #
library(htmlwidgets)
css_fix <- "div.info.legend.leaflet-control br {clear: both;}" # CSS to correct spacing
html_fix <- htmltools::tags$style(type = "text/css", css_fix)  # Convert CSS to HTML


# Funciton allows to plot interactive leaflet maps of median US rents by year
map_rent_i <- function(df, year){

#---------------------------------------------------#
# HTML code to add title to interactive maps
#---------------------------------------------------#
tag.map.title <- tags$style(HTML("
.leaflet-control.map-title { 
    transform: translate(-50%,20%);
    position: fixed !important;
    left: 50%;
    text-align: center;
    padding-left: 10px; 
    padding-right: 10px; 
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    color: #000000;
    font-size: 20px;
  }
"))

title <- tags$div(
  tag.map.title, HTML(paste(year, "Median Rent:", 1, "Bed"))
)  
    
  correction = i_factor(year)# find correction factor
  
  # Pop-up message
  pu_message <- paste0("County: ", df$countyname,  
                     "<br>Rent: ", "$", floor(df$fmr1/correction)) # in 2010 dollars
  
  # Adding more customization 
  interactive_map <- leaflet(df) %>%
    addPolygons(stroke = FALSE,                        # remove polygon borders
                fillColor = ~pal_fun(fmr1/correction), # in 2010 dollars
                fillOpacity = 0.5, smoothFactor = 0.5, # increase opacity and resolution
                popup = ~pu_message) %>%
    addProviderTiles(providers$CartoDB.Positron) %>%   # add third party provider tile
    #addProviderTiles(providers$Stamen.Toner) %>%
    #addProviderTiles(providers$Esri.NatGeoWorldMap)
    addLegend("bottomright",                           # location of legend
              pal=pal_fun,                             # palette function
              values=~fmr1/correction,                  # variable to be passed to palette function
            title = paste("Rent",1,"bed (2020 $)"),  # legend title
            opacity = 1) %>% # legend opacity (1 = completely opaque)
    addScaleBar() %>%
    addTiles() %>%
    addControl(title, position = "topleft", className="map-title") #add title panel
  
  return(interactive_map %<>% htmlwidgets::prependContent(html_fix))
}

# Call interactive map function and display 
fmr_2020_mapi <- map_rent_i(fmr_2020_geo, 2020) 
fmr_2020_mapi

ACS poverty and income data

## Getting data from the 2014-2018 5-year ACS

# calculate poverty rate and percent spent on housing by county for one year

thresh <- 0 # threshold for filtering poverty estimates

acs.fmr.2017 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2018 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2019 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2020 %<>%
  dplyr::filter(poverty_estimate > thresh) %>%
  dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
  dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)


# plot map of percentage spent on median 1-bed apartment by median income
rent_income_map <- ggplot() +
  geom_sf(data=acs.fmr.2017,lwd=0, 
          aes(fill = housing_percent)) + # adjust factor included
  ggtitle("Percent of income spent on rent (2017 1-Bed)") + 
  scale_fill_gradientn(name = "% spent on rent", 
                       colours = myPalette2(100)) + # add limits and breaks to normalize scaling between years  
  north(data = acs.fmr.2017, location = "bottomright", symbol = 12) +
  scalebar(data = acs.fmr.2017, dist = 500, dist_unit = 'mi', transform = TRUE,
           location = "bottomleft",
           st.dist = 0.1) +
  my_theme()

rent_income_map # show the median income map

mapthegap data

## Warning: NAs introduced by coercion

# generate a two-factor cost of living index by adding food and housing costs

acs.fmr.food.2017 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2018 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2019 %<>%
  mutate(COL = meal_percent + housing_percent)

acs.fmr.food.2020 %<>%
  mutate(COL = meal_percent + housing_percent)

COL_2017 <- ggplot(acs.fmr.food.2017, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2017 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2018 <- ggplot(acs.fmr.food.2018, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2018 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2019 <- ggplot(acs.fmr.food.2019, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2019 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

COL_2020 <- ggplot(acs.fmr.food.2020, aes(x=COL, y=poverty_percent))+
  geom_point()+  
  geom_smooth(method = "lm", formula = y~x) +
  xlab("Combined food and housing costs") + 
  ylab("Poverty Rate %") + 
  ggtitle("2020 US Counties ") +
  theme(plot.title = element_text(hjust = 0.5)) +
  stat_regline_equation() 

combined_COL <- cowplot::plot_grid(COL_2017,COL_2018,COL_2019,COL_2020) 

combined_COL # display the combined graph

Based on these analyses, poverty rate and income are directly correlated as expected. However, median housing cost and food expenses alone do not seem to be directly associated with the poverty rate. Furthermore, calculated variables such as percent-spent-on-housing and percent-spent-on-food do not seem to give better correlations to poverty rate compared to median income alone, From these results I will use housing costs, food costs, and either poverty rate or median income from aggreate county data to predict health outcomes from BRFSS data.

## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
## 
##     mutate
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

Get BRFSS data

# select variables of interest
# clean BRFSS 2017 dataframe

brfss2017 <- brfss2017_raw %>%
  dplyr::select('MMSANAME', 'X_MMSA', 'X_RACE', 'SEX', 'X_AGEG5YR','EDUCA','INCOME2',
                'ADDEPEV2', 'CHCCOPD1', 'X_RFHLTH') %>%
  dplyr::rename('Metro Area'='MMSANAME','Metro_ID'='X_MMSA','age'='X_AGEG5YR','race'='X_RACE', 
         'sex'='SEX', 'education'='EDUCA','income'='INCOME2',
         'depression'='ADDEPEV2','copd'='CHCCOPD1', 'health'='X_RFHLTH') %>%
  dplyr::filter(race %in%  c(1,2,3,5,8)) %>%
  dplyr::filter(sex %in% c(1,2)) %>%
  dplyr::filter(age %in% seq(13)) %>%
  dplyr::filter(education %in%  c(1,2,3,4,5,6)) %>%
  dplyr::filter(income %in% c(1,2,3,4,5,6,7,8)) %>%
  dplyr::filter(depression %in%  c(1,2)) %>%
  dplyr::filter(copd %in% c(1,2)) %>%
  dplyr::filter(health %in% c(1,2))

brfss2017 %<>%
  dplyr::mutate('Metro_ID' = as.character(brfss2017$'Metro_ID')) %>%
  dplyr::mutate(race = factor(race, levels = c(1,2,3,5,8), 
                        labels = c( "White", "Black", "American Indian", 
                                    "Asian", "Hispanic"))) %>%
  dplyr::mutate(sex = factor(sex, levels = c(1,2), 
                        labels = c("Male", "Female"))) %>%
  
  dplyr::mutate(age = factor(age, levels = seq(13), 
                        labels = c("18-24","25-29","30-34","35-39","40-44","45-49","50-54",
                                   "55-59","60-64","65-69","70-74","75-29",">79"))) %>%
  dplyr::mutate(education = factor(education, levels = c(1,2,3,4,5,6), 
                        labels = c("None","Elementary", "Some HS", "HS",
                                   "Some College", "College"))) %>%
  dplyr::mutate(income = factor(income, levels = c(1,2,3,4,5,6,7,8), 
                        labels = c("10k","15k", "20k", "25k", "35k", "50k", "75k","> 75k"))) %>%
  dplyr::mutate(depression = factor(depression, levels = c(1,2), 
                        labels = c("yes", "no"))) %>%
  dplyr::mutate(copd = factor(copd, levels = c(1,2), 
                        labels = c("yes", "no"))) %>%
  dplyr::mutate(health = factor(health, levels = c(1,2), 
                        labels = c("good", "poor"))) 

Let’s look at the summary of the clean 2017 data frame. I am interested in sex, age, income, education, and race as demographic factors. I am interested in previous depression, COPD, and general health as outcomes. From the summary, there is a bias in sex and age. This might be explained by the fact that BRFSS is a phone based survey. It seems that income and education is skewed twoards highly educated, high earning individuals as well. The BRFSS might be biased towards home-owners and landline users over cell-phone users.

##   Metro Area          Metro_ID                      race            sex           age               education       income      depression    copd         health      
##  Length:176937      Length:176937      White          :138185   Male  :79948   18-24: 9372   None        :  210   10k  : 7348   yes: 35853   yes: 13249   good:146650  
##  Class :character   Class :character   Black          : 18934   Female:96989   25-29: 9552   Elementary  : 3202   15k  : 7613   no :141084   no :163688   poor: 30287  
##  Mode  :character   Mode  :character   American Indian:  1806                  30-34:10675   Some HS     : 6744   20k  :11613                                          
##                                        Asian          :   385                  35-39:11531   HS          :41713   25k  :14631                                          
##                                        Hispanic       : 17627                  40-44:11104   Some College:48317   35k  :17058                                          
##                                                                                45-49:13113   College     :76751   50k  :23933                                          
##                                                                                50-54:15770                        75k  :27922                                          
##                                                                                55-59:18469                        > 75k:66819                                          
##                                                                                60-64:19574                                                                             
##                                                                                65-69:19402                                                                             
##                                                                                70-74:15710                                                                             
##                                                                                75-29:10526                                                                             
##                                                                                >79  :12139

Data Wrangling to apply an appropriate metro code to every county

Look at how aggregate costs/income data and individual income data affect health outcomes

Rent effects on health outcomes

aggregate food expenses and health outcomes

Aggregate poverty effect on health outcomes

Individual income associations with health visualizations

Now we will run chi-square tests to see if income is indepdent from chosen

health outcomes.

##       
##          10k   15k   20k   25k   35k   50k   75k > 75k
##   good  3024  3345  5988  8708 10974 16542 20359 47235
##   poor  2467  2603  3159  3343  3118  3340  2690  3325
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_health_table
## X-squared = 13558, df = 7, p-value < 0.00000000000000022
##      
##         10k   15k   20k   25k   35k   50k   75k > 75k
##   yes  2083  2196  2635  3141  3157  4095  4466  7479
##   no   3408  3752  6512  8910 10935 15787 18583 43081
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_depression_table
## X-squared = 3625.9, df = 7, p-value < 0.00000000000000022
##      
##         10k   15k   20k   25k   35k   50k   75k > 75k
##   yes   988  1234  1360  1492  1374  1630  1343  1553
##   no   4503  4714  7787 10559 12718 18252 21706 49007
## 
##  Pearson's Chi-squared test
## 
## data:  brfss2017_COPD_table
## X-squared = 4926.9, df = 7, p-value < 0.00000000000000022

Based on visualizaitons and chi-squared tests, it seems that there is an association between income and general health and history of both COPD and depression

Exploring the health outcomes (depression, COPD, general health) using all individual demographic factors only by logistic regression.

Check model accuracy using individual demographics alone

Exploring the health outcomes (depression, COPD, general health) using both individual demographic factors and aggregate metropolitan factors by logistic regression.

check model accuracy using individual demographics and aggregate metro data

##         1         2         3         4         5         6 
## 0.7964293 0.8271381 0.6311824 0.5844652 0.8867128 0.8109481

Adding in the additional variables from ACS median income an housing costs only slightly increased AUC, but not enough to merit including them into GLM models. Overall, the value of individual survey data is much more valuable in predicting history of depression, COPD, and general well-being/health.